NSF PAR Search | NSF Public Access Repository

Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes

https://doi.org/10.1145/3680279

Sharma, Nihal; Sen, Rajat; Basu, Soumya; Shanmugam, Karthikeyan; Shakkottai, Sanjay (September 2024, ACM Transactions on Modeling and Performance Evaluation of Computing Systems)

We study a variant of the contextual bandit problem where an agent can intervene through a set of stochastic expert policies. Given a fixed context, each expert samples actions from a fixed conditional distribution. The agent seeks to remain competitive with the “best” among the given set of experts. We propose the Divergence-based Upper Confidence Bound (D-UCB) algorithm that uses importance sampling to share information across experts and provide horizon-independent constant regret bounds that only scale linearly in the number of experts. We also provide the Empirical D-UCB (ED-UCB) algorithm that can function with only approximate knowledge of expert distributions. Further, we investigate the episodic setting where the agent interacts with an environment that changes over episodes. Each episode can have different context and reward distributions resulting in the best expert changing across episodes. We show that by bootstrapping from\(\mathcal {O}(N\log (NT^2\sqrt {E}))\)samples, ED-UCB guarantees a regret that scales as\(\mathcal {O}(E(N+1) + \frac{N\sqrt {E}}{T^2})\)forNexperts overEepisodes, each of lengthT. We finally empirically validate our findings through simulations.

Full Text Available

Search for: All records